We discuss a platform that has both software and hardware components, and whose purpose is to support research into characterizing and mitigating the sim-to-real gap in robotics and vehicle autonomy engineering. The software is operating-system independent and has three main components: a simulation engine called Chrono, which supports high-fidelity vehicle and sensor simulation; an autonomy stack for algorithm design and testing; and a development environment that supports visualization and hardware-in-the-loop experimentation. The accompanying hardware platform is a 1/6th scale vehicle augmented with reconfigurable mountings for computing, sensing, and tracking. Since this vehicle platform has a digital twin within the simulation environment, one can test the same autonomy perception, state estimation, or controls algorithms, as well as the processors they run on, in both simulation and reality. A demonstration is provided to show the utilization of this platform for autonomy research. Future work will concentrate on augmenting ART/ATK with support for a full-sized Chevy Bolt EUV, which will be made available to this group in the immediate future.
translated by 谷歌翻译
鉴于生成对抗网络(GAN)的多功能性,我们试图了解使用现有的gan从现有的gan增强模拟图像并减少SIM卡之间的差距所带来的好处。我们在模拟机器人性能和基于图像的感知的背景下进行分析。具体而言,我们量化了GAN减少机器人技术图像感知差异的能力。使用语义细分,我们使用名义上和增强的城市环境模拟来分析训练和测试中的SIM对差异。作为次要应用,我们考虑使用GAN来增强室内环境。对于此应用,对象检测用于分析训练和测试的增强。提出的结果量化了使用GAN时SIM到真实差距的减少,并说明了其使用的好处。
translated by 谷歌翻译
该贡献的重点是摄像机模拟,因为它在模拟其虚拟原型制作时会发挥作用。我们根据感知算法的性能和测量性能的上下文提出了相机模型验证方法。这种方法与传统的合成图像验证不同,合成图像通常是在像素或特征级别进行的,并且倾向于需要匹配的一对合成图像和真实图像。由于获取配对图像的成本和限制很高,因此提出的方法基于不一定是配对的数据集。在真实和模拟数据集中,A和B分别在统计上找到了类似内容和法官的子集AC和BC子集AC和BC,从统计学上讲,感知算法对这些相似子集的响应。这种验证方法获得了性能相似性的统计度量,以及A和B的内容之间的相似性度量,使用Chrono ::传感器生成的图像和缩放自动驾驶汽车,使用对象检测器作为对象检测器作为量表来证明该方法。感知算法。结果证明了量化模拟和真实数据之间(i)差异的能力; (ii)减轻SIM到真实差距的训练方法的倾向; (iii)两个数据集之间的上下文重叠。
translated by 谷歌翻译
我们描述了一个软件框架和用于串联的硬件平台,用于设计和分析模拟和现实中机器人自主算法。该软件是开源的,独立的容器和操作系统(OS)的软件,具有三个主要组件:COS ++车辆仿真框架(Chrono)的ROS 2接口(Chrono),该框架提供了高保真的轮毂/跟踪的车辆和传感器仿真;基于ROS 2的基本基于算法设计和测试的自治堆栈;以及一个开发生态系统,可在感知,状态估计,路径计划和控制中进行可视化和硬件实验。随附的硬件平台是1/6刻度的车辆,并具有可重新配置的用于计算,传感和跟踪的可重新配置的安装。其目的是允许对算法和传感器配置进行物理测试和改进。由于该车辆平台在模拟环境中具有数字双胞胎,因此可以测试和比较模拟和现实中相同的算法和自主堆栈。该平台的构建是为了表征和管理模拟到现实差距。在此,我们描述了如何建立,部署和用于改善移动应用程序的自主权。
translated by 谷歌翻译
Despite its importance for federated learning, continuous learning and many other applications, on-device training remains an open problem for EdgeAI. The problem stems from the large number of operations (e.g., floating point multiplications and additions) and memory consumption required during training by the back-propagation algorithm. Consequently, in this paper, we propose a new gradient filtering approach which enables on-device DNN model training. More precisely, our approach creates a special structure with fewer unique elements in the gradient map, thus significantly reducing the computational complexity and memory consumption of back propagation during training. Extensive experiments on image classification and semantic segmentation with multiple DNN models (e.g., MobileNet, DeepLabV3, UPerNet) and devices (e.g., Raspberry Pi and Jetson Nano) demonstrate the effectiveness and wide applicability of our approach. For example, compared to SOTA, we achieve up to 19$\times$ speedup and 77.1% memory savings on ImageNet classification with only 0.1% accuracy loss. Finally, our method is easy to implement and deploy; over 20$\times$ speedup and 90% energy savings have been observed compared to highly optimized baselines in MKLDNN and CUDNN on NVIDIA Jetson Nano. Consequently, our approach opens up a new direction of research with a huge potential for on-device training.
translated by 谷歌翻译
We present a novel corpus for French dialect identification comprising 413,522 French text samples collected from public news websites in Belgium, Canada, France and Switzerland. To ensure an accurate estimation of the dialect identification performance of models, we designed the corpus to eliminate potential biases related to topic, writing style, and publication source. More precisely, the training, validation and test splits are collected from different news websites, while searching for different keywords (topics). This leads to a French cross-domain (FreCDo) dialect identification task. We conduct experiments with four competitive baselines, a fine-tuned CamemBERT model, an XGBoost based on fine-tuned CamemBERT features, a Support Vector Machines (SVM) classifier based on fine-tuned CamemBERT features, and an SVM based on word n-grams. Aside from presenting quantitative results, we also make an analysis of the most discriminative features learned by CamemBERT. Our corpus is available at https://github.com/MihaelaGaman/FreCDo.
translated by 谷歌翻译
Text-guided image editing can have a transformative impact in supporting creative applications. A key challenge is to generate edits that are faithful to input text prompts, while consistent with input images. We present Imagen Editor, a cascaded diffusion model built, by fine-tuning Imagen on text-guided image inpainting. Imagen Editor's edits are faithful to the text prompts, which is accomplished by using object detectors to propose inpainting masks during training. In addition, Imagen Editor captures fine details in the input image by conditioning the cascaded pipeline on the original high resolution image. To improve qualitative and quantitative evaluation, we introduce EditBench, a systematic benchmark for text-guided image inpainting. EditBench evaluates inpainting edits on natural and generated images exploring objects, attributes, and scenes. Through extensive human evaluation on EditBench, we find that object-masking during training leads to across-the-board improvements in text-image alignment -- such that Imagen Editor is preferred over DALL-E 2 and Stable Diffusion -- and, as a cohort, these models are better at object-rendering than text-rendering, and handle material/color/size attributes better than count/shape attributes.
translated by 谷歌翻译
Depth cues are known to be useful for visual perception. However, direct measurement of depth is often impracticable. Fortunately, though, modern learning-based methods offer promising depth maps by inference in the wild. In this work, we adapt such depth inference models for object segmentation using the objects' ``pop-out'' prior in 3D. The ``pop-out'' is a simple composition prior that assumes objects reside on the background surface. Such compositional prior allows us to reason about objects in the 3D space. More specifically, we adapt the inferred depth maps such that objects can be localized using only 3D information. Such separation, however, requires knowledge about contact surface which we learn using the weak supervision of the segmentation mask. Our intermediate representation of contact surface, and thereby reasoning about objects purely in 3D, allows us to better transfer the depth knowledge into semantics. The proposed adaptation method uses only the depth model without needing the source data used for training, making the learning process efficient and practical. Our experiments on eight datasets of two challenging tasks, namely camouflaged object detection and salient object detection, consistently demonstrate the benefit of our method in terms of both performance and generalizability.
translated by 谷歌翻译
Can we leverage the audiovisual information already present in video to improve self-supervised representation learning? To answer this question, we study various pretraining architectures and objectives within the masked autoencoding framework, motivated by the success of similar methods in natural language and image understanding. We show that we can achieve significant improvements on audiovisual downstream classification tasks, surpassing the state-of-the-art on VGGSound and AudioSet. Furthermore, we can leverage our audiovisual pretraining scheme for multiple unimodal downstream tasks using a single audiovisual pretrained model. We additionally demonstrate the transferability of our representations, achieving state-of-the-art audiovisual results on Epic Kitchens without pretraining specifically for this dataset.
translated by 谷歌翻译
Learning continuous image representations is recently gaining popularity for image super-resolution (SR) because of its ability to reconstruct high-resolution images with arbitrary scales from low-resolution inputs. Existing methods mostly ensemble nearby features to predict the new pixel at any queried coordinate in the SR image. Such a local ensemble suffers from some limitations: i) it has no learnable parameters and it neglects the similarity of the visual features; ii) it has a limited receptive field and cannot ensemble relevant features in a large field which are important in an image; iii) it inherently has a gap with real camera imaging since it only depends on the coordinate. To address these issues, this paper proposes a continuous implicit attention-in-attention network, called CiaoSR. We explicitly design an implicit attention network to learn the ensemble weights for the nearby local features. Furthermore, we embed a scale-aware attention in this implicit attention network to exploit additional non-local information. Extensive experiments on benchmark datasets demonstrate CiaoSR significantly outperforms the existing single image super resolution (SISR) methods with the same backbone. In addition, the proposed method also achieves the state-of-the-art performance on the arbitrary-scale SR task. The effectiveness of the method is also demonstrated on the real-world SR setting. More importantly, CiaoSR can be flexibly integrated into any backbone to improve the SR performance.
translated by 谷歌翻译